fix(shield): prevent stuck allowlist-waiter Job after hook failure by francesco-furlan · Pull Request #2619 · sysdiglabs/charts

francesco-furlan · 2026-05-12T14:46:39Z

What this PR does / why we need it:

Fixes a stuck-cluster state observed on a customer running shield-1.37.1 on GKE Autopilot via Flux: when the pre-install shield-host-allowlist-waiter Job times out or fails, Helm's hook executor (per helm/helm pkg/action/hooks.go) applies the HookSucceeded delete-policy to all previously-succeeded hooks in the batch, sweeping the waiter's SA/CR/CRB while the failed Job survives. The Job then loops FailedCreate forever against a missing ServiceAccount.

Note on history
This PR originally also decoupled the waiter SA/CR/CRB from host.rbac.create via a dedicated gke_autopilot.allowlist_waiter.create_rbac flag. That work has been superseded by #2633, which landed a richer per-resource RBAC schema (gke_autopilot.allowlist_waiter.rbac.{create,service_account.create,cluster_role.create}). The original commit was dropped during the rebase onto main; this PR no longer introduces RBAC plumbing of its own.

Three logically separable changes, one commit each (kept independently bisectable), on top of main's RBAC sub-toggle work:

The core fix:
- Drop hook-succeeded from the SA/CR/CRB helm.sh/hook-delete-policy so they survive Helm's HookSucceeded sweep on Job failure.
- Add hook-failed to the Job's helm.sh/hook-delete-policy so a failed Job is reaped instead of looping FailedCreate.
- AllowlistSynchronizer hook policy unchanged (already before-hook-creation only — it must persist past the waiter).
Emit kubectl describe and full YAML of the AllowlistSynchronizer on wait failure, so the next on-caller has actionable diagnostics instead of a bare exit (the customer's first failure was unrecoverable because pod logs were already evicted by the time investigation started). No new RBAC needed.
Add gke_autopilot.allowlist_waiter.active_deadline_seconds (default 300) wired into Job.spec.activeDeadlineSeconds. Belt-and-suspenders Job-level guard against pod hangs before the inner kubectl wait timeout fires (image-pull stalls, scheduler delays, admission webhook hangs).

Checklist

Title of the PR starts with type and scope
Chart Version bumped for the respective charts (1.39.0 → 1.40.0)
Variables are documented in the README.md
Check GithubAction checks (like lint) to avoid merge-check stoppers
All test files are added in the tests folder of their respective chart and have a "_test" suffix

…k failure Helm 3's hook executor applies the HookSucceeded delete-policy to both the failed hook and all previously-succeeded hooks in the batch (see helm/helm pkg/action/hooks.go execHook). When the allowlist-waiter Job (weight 5) fails, Helm sweeps the SA/CR/CRB (weight -5) because they carried hook-succeeded — leaving the failed Job referencing a missing SA, which produces a FailedCreate retry storm bounded only by manual intervention. Two surgical changes: * Drop hook-succeeded from the SA/CR/CRB delete-policy. They now persist across failures (and across successes — a lingering tiny SA/CR/CRB that the next install's before-hook-creation reclaims). * Add hook-failed to the Job's delete-policy so a failed Job is reaped instead of looping forever against a deleted SA. The AllowlistSynchronizer is unchanged (already before-hook-creation only — it must persist after the waiter exits since host-shield pods rely on Warden seeing it).

When the waiter's kubectl wait times out, the Pod previously exited with just the bare wait error, leaving no diagnostics by the time the next on-caller arrived (especially after kubelet log eviction). On non-zero exit, the script now emits a kubectl describe and full YAML of the AllowlistSynchronizer to stderr before exiting with the original error code. The existing RBAC (get/list/watch on allowlistsynchronizers) is sufficient; events lookup is intentionally skipped to avoid widening the ClusterRole.

…econds The Job's only existing timeout lives inside the container script (kubectl wait --timeout). If the Pod hangs before that line executes (image-pull stall, scheduler delay, admission webhook latency), the Job can outrun Helm's pre-install hook timeout. Add a Job-level activeDeadlineSeconds guard via the new gke_autopilot.allowlist_waiter.active_deadline_seconds value, defaulting to 300 (180s of headroom over the default 120s wait timeout).

francesco-furlan requested a review from a team as a code owner May 12, 2026 14:46

francesco-furlan added 4 commits May 25, 2026 10:34

docs: update

daf64fb

francesco-furlan force-pushed the shield/fix-allowlist-waiter-hook-cleanup branch from d53d85f to daf64fb Compare May 25, 2026 08:38

AlbertoBarba approved these changes May 25, 2026

View reviewed changes

francesco-furlan merged commit a3329c9 into main May 25, 2026
8 checks passed

francesco-furlan deleted the shield/fix-allowlist-waiter-hook-cleanup branch May 25, 2026 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619

fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619
francesco-furlan merged 4 commits into
mainfrom
shield/fix-allowlist-waiter-hook-cleanup

francesco-furlan commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

francesco-furlan commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

francesco-furlan commented May 12, 2026 •

edited

Loading